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Abstract 


The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the 
multi-index regression model with deterministic design and additive noise is considered. A new 
procedure for recovering the directions of the EDR subspace is proposed. Many methods for 
estimating the EDR subspace perform principal component analysis on a family of vectors, say 
By ei . PEs nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive 
approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the pro- 
jector onto the EDR subspace by the solution to the optimization problem 


minimize m B/ (I—A)By subject to A € Ans, 


giui 


where An» is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal 
to m*, with m* being the true structural dimension. Under mild assumptions, ,/n-consistency of the 
proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension 
is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR 
subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors 
By for estimating the EDR subspace. The empirical behavior of the algorithm is studied through 
numerical simulations. 


Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, cen- 
tral subspace 


1. Introduction 


One of the most challenging problems in modern statistics is to find efficient methods for treating 
high-dimensional data sets. In various practical situations the problem of predicting or explaining a 
scalar response variable Y by d scalar predictors X!),...,X arises. For solving this problem one 
should first specify an appropriate mathematical model and then find an algorithm for estimating 
that model based on the observed data. In the absence of a priori information on the relationship 
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between Y and X = (X es cae D), complex models are to be preferred. Unfortunately, the accu- 
racy of estimation is in general a decreasing function of the model complexity. For example, in the 
regression model with additive noise and two-times continuously differentiable regression function 
f: R1 — R, the most accurate estimators of f based on a sample of size n have a quadratic risk 
decreasing as n—4/(4+4) when n becomes large. This rate deteriorates very rapidly with increasing 
d leading to unsatisfactory accuracy of estimation for moderate sample sizes. This phenomenon is 
called “curse of dimensionality”, the latter term being coined by Bellman (1961). 

To overcome the “curse of dimensionality”, additional restrictions on the candidates f for de- 
scribing the relationship between Y and X are necessary. One popular approach is to consider the 
multi-index model with m* indices: for some linearly independent vectors 9), ..., Òm» and for some 
function g : R” — R, the relation f(x) = g(0]x,...,0,).x) holds for every x € R?. Here and in 
the sequel the vectors are understood as one column matrices and M' denotes the transpose of the 
matrix M. Of course, such a restriction is useful only if m* < d and the main argument in favor 
of using the multi-index model is that for most data sets the underlying structural dimension m* is 
substantially smaller than d. Therefore, if the vectors 01, ..., Òm* are known, the estimation of f 
reduces to the estimation of g, which can be performed much better because of lower dimensionality 
of the function g compared to that of f. 

Another advantage of the multi-index model is that it postulates that only few linear combina- 
tions of the predictors may suffice for “explaining” the response Y. Considering these combinations 
as new predictors leads to a much simpler model (due to its low dimensionality), which can be suc- 
cessfully analyzed by graphical methods, see Cook and Weisberg (1999) and Cook (1998) for more 
details. 

Throughout this work we assume that we are given n observations (Y;,X1),...,(Yn,Xn) from the 
model 


Y; = f (Xi) + £i = 9(0] Xi,- -Om Xi) +€i, (1) 


where €1,...,€, are unobserved errors assumed to be mutually independent zero mean random vari- 
ables, independent of the design {X;, i < n}. 

Since it is unrealistic to assume that 01,...,0,,*« are known, estimation of these vectors from the 
data is of high practical interest. When the function g is unspecified, only the linear subspace Se 
spanned by these vectors may be identified from the sample. This subspace is usually called index 
space or dimension-reduction (DR) subspace. Clearly, there are many DR subspaces for a fixed 
model f. Even if f is observed without error, only the smallest DR subspace, henceforth denoted 
by S, can be consistently identified. This smallest DR subspace, which is the intersection of all 
DR subspaces, is called effective dimension-reduction (EDR) subspace (Li, 1991) or central mean 
subspace (Cook and Li, 2002). We adopt in this paper the former term, in order to be consistent 
with Hristache et al. (2001a) and Xia et al. (2002), which are the closest references to our work. 

The present work is devoted to studying a new algorithm for estimating the EDR subspace. We 
call it structural adaptation via maximum minimization (SAMM). It can be regarded as a branch 
of the structure-adaptive (SA) approach introduced in Hristache et al. (2001b) and Hristache et al. 
(2001a). 

Note that a closely related problem is the estimation of the central subspace (CS), see Cook and 
Weisberg (1999) for its definition. For model (1) with i.i.d. predictors, the CS coincides with the 
EDR subspace. Hence, all the methods developed for estimating the CS can potentially be applied 
in our set-up. We refer to Cook and Li (2002) for background on the difference between the CS and 
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the central mean subspace and to Cook and Ni (2005) for a discussion of the relationship between 
different algorithms estimating these subspaces. 

There are a number of methods providing an estimator of the EDR subspace in our set-up. 
These include ordinary least square (Li and Duan, 1989), sliced inverse regression (Li, 1991), sliced 
inverse variance estimation (Cook and Weisberg, 1991), principal Hessian directions (Li, 1992), 
graphical regression (Cook, 1998), parametric inverse regression (Bura and Cook, 2001a), SA ap- 
proach (Hristache et al., 2001a), iterative Hessian transformation (Cook and Li, 2002), minimum 
average variance estimation (MAVE) (Xia et al., 2002), nonparametric linear smoothing for inverse 
regression (Bura, 2003), minimum discrepancy approach (Cook and Ni, 2005) marginal high mo- 
ment regression (Yin and Cook, 2006) density based MAVE and outer product of gradient (Xia, 
2007), as well as the refinements using contour-projection (Wang et al., 2008), intraslice covariance 
estimation (Cook and Ni, 2006) and Lasso shrinkage (Ni et al., 2005; Li, 2007). 

All these methods, except SA approach and MAVE, rely on the principle of inverse regression 
(IR). Therefore they inherit its well known limitations. First, they require a hypothesis on the prob- 
abilistic structure of the predictors usually called linearity condition. Second, there is no theoretical 
justification guaranteeing that these methods estimate the whole EDR subspace and not just a part 
thereof, see Cook and Li (2004, Section 3.1) and the comments on the third example in Hristache 
et al. (2001a, Section 4). In the same time, they have the advantage of being simple for implemen- 
tation and for inference. 

The two other methods mentioned above—SA approach and MAVE—have much wider appli- 
cability including even time series analysis. The inference for these methods is more involved than 
that of IR based methods, but SA approach and MAVE need no strong requirements on the design 
of covariates or on the response variable. Moreover, in many cases they provide more accurate 
estimates of the EDR subspace (Hristache et al., 2001a; Xia et al., 2002; Xia, 2007). 

These arguments, combined with empirical experience, indicate the complementarity of differ- 
ent methods designed to estimate the EDR subspace. It turns out that there is no procedure among 
those cited above that outperforms all the others in plausible settings. Therefore, a reasonable strat- 
egy for estimating the EDR subspace is to execute different procedures and to take a decision after 
comparing the obtained results. In the case of strong contradictions, collecting additional data is 
recommended. 

The algorithm SAMM we introduce here exploits the fact that the gradient V f of the regression 
function f evaluated at any point x € R? belongs to the EDR subspace. The estimation of the 
gradient being an ill-posed inverse problem, it is better to estimate some linear combinations of 
Vf(X1),---,Vf(Xn), which still belong to the EDR subspace. 

Let L be a positive integer. The main idea behind the algorithm proposed in Hristache et al. 
(2001a) consists in iteratively estimating L linear combinations ĝß1,...,Bz of vectors Vf(X1),..., 
V f(Xn) and then recovering the EDR subspace from the vectors By by running a principal compo- 
nent analysis (PCA). The resulting estimator is proved to be ,/n-consistent provided that L is cho- 
sen independently of the sample size n. Unfortunately, if L is small with respect to n, the subspace 
spanned by the vectors B,,...,{z may cover only a part of the EDR subspace. Therefore, empirical 
experience advocates for large values of L, even if the desirable feature of \/n-consistency fails in 
this case. 

The estimator proposed in the present work is designed to provide a remedy for this dissension 
between the theory and empirical experience. This goal is achieved by introducing a new method 
of extracting the EDR subspace from the estimators of the vectors B;,...,Bz. If we think of PCA 
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as the solution to a minimization problem involving a sum over L terms, see (5) in the next section, 
then, to some extent, our proposal consists in replacing the sum by the maximum. This motivates 
the term structural adaptation via maximum minimization. The main advantage of SAMM is that 
it allows us to deal with the case when L increases polynomially in n and yields an estimator of 
the EDR subspace which is consistent under a very weak identifiability assumption. In addition, 
SAMM provides a \/n-consistent estimator (up to a logarithmic factor) of the EDR subspace when 
m <4. 

If m* = 1, the corresponding model is referred to as single-index regression. There are many 
methods for estimating the EDR subspace in this case, see Yin and Cook (2005); Delecroix et al. 
(2006) and the references therein. Note also that the methods for estimating the EDR subspace have 
often their counterparts in the partially linear regression analysis, see for example Samarov et al. 
(2005) and Chan et al. (2004). 

An interesting problem in the context of dimensionality reduction is the estimation of the true 
structural dimension m*. Many approaches exist for constructing estimators of m*, see (Li, 1991, 
Section 5), (Xia et al., 2002, Section 2.2), and Bura and Cook (2001b), Bura and Cook (2001a), 
Bura (2003) and Cook and Li (2004) and the references therein. Here we assume that the structural 
dimension is known, leaving the development of an extension to the case of unknown m* for future 
investigation. 

The rest of the paper is organized as follows. We review the structure-adaptive approach and 
introduce the SAMM procedure in Section 2. Theoretical features including ,/n-consistency of the 
procedure are stated in Section 3. Section 4 contains an empirical study of the proposed procedure 
through Monte Carlo simulations. The technical proofs are deferred to Section 5. 


2. Structural Adaptation and SAMM 


Introduced in Hristache et al. (2001b), the structure-adaptive approach is based on two observa- 
tions. First, knowing the structural information helps better estimate the model function. Second, 
improved model estimation contributes to recovering more accurate structural information about 
the model. These advocate for the following iterative procedure. Start with the null structural infor- 
mation, then iterate the above-mentioned two steps (estimation of the model and extraction of the 
structure) several times improving the quality of model estimation and increasing the accuracy of 
structural information during the iteration. 


2.1 Purely Nonparametric Local Linear Estimation 


When no structural information is available, one can only proceed in a fully nonparametric way. A 
proper estimation method is based on local linear smoothing (cf. Fan and Gijbels, 1996, for more 
details): estimators of the function f and its gradient V f at a point X; are given by 


Ax, e 
(Ea) ar | (¥j—a—c' Xj) K(|Xil/b”) 


ilea) GD} Ee) «(Ge). 


where Xj; = X; — Xi, b is a bandwidth and K(-) is a univariate kernel supported on [0,1]. (For a 
vector v, |v| stands for its Euclidean norm.) The bandwidth b should be selected so that the ball 
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with radius b centered at the point of estimation X; contains at least d+ 1 design points. For large 
value of d this leads to a large bandwidth and to a strong estimation bias. The goal of the structural 
adaptation is to diminish this bias using an iterative procedure exploiting the available estimated 
structural information. 

In order to transform these general observations into a concrete procedure, let us describe in 
the rest of this section how the knowledge of the structure can help to improve the quality of the 
estimation and how the structural information can be obtained when the function or its estimator is 
given. 


2.2 Model Estimation When an Estimator of S is Available 


Let us start with the case of known 5S. The function f has the same smoothness as g in the directions 
of the EDR subspace § spanned by the vectors 01,...,0*, whereas it is constant (and therefore, 
infinitely smooth) in all the orthogonal directions. This suggests to apply an anisotropic bandwidth 
for estimating the model function and its gradient. The corresponding local-linear estimator can be 


defined by 
(A) p l E(x) (x,) on) > Yj Ax p)" Wis (2) 


with the weights w;; = K(\IT*X; j|7/h?), where h is some positive real number and TI* is the orthogo- 
nal projector onto the EDR subspace S. This choice of weights amounts to using infinite bandwidth 
in the directions lying in the orthogonal complement of the EDR subspace. 

If only an estimator A of the orthogonal projector TI* is available, a possible strategy is to 
replace IT* by A in the definitions of the weights wij- This strategy is however too stringent, since 
it definitely discards the directions belonging to $t. Being not sure that our information about the 
structure is exact, it is preferable to define the neighborhoods in a softer way. This is done by setting 
Wij = K(X; (1 +p7°Â)X;;/h?) and by redefining 


-E e EnG) ; 


Here, p is a real number from the interval [0, 1] measuring the importance attributed to the estimator 
A. If we are very confident in our estimator A, we should choose p close to zero. 


2.3 Recovering the EDR Subspace from an Estimator of V f 


Suppose first that the values of the function Vf at the points X; are known. Then S is the linear 
subspace of R? spanned by the vectors Vf(X;), i= 1,...,n. For classifying the directions of Rf 
according to the variability of f in each direction and, as a by-product, identifying S, the principal 
component analysis (PCA) can be used. 
Recall that the PCA method is based on the orthogonal decomposition of the matrix W = 
nly", V(X) Vf(X;)': M = OAO? with an orthogonal matrix O and a diagonal matrix A with 
daoia] entries Ay > Ap >... > Ag. Clearly, for the multi-index model with m*-indices, only the 
first m* eigenvalues of .@ are positive. The first m* eigenvectors of .W (or, equivalently, the first 
m* columns of the matrix O) define an orthonormal basis in the EDR subspace. 
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Let L be a positive integer. In Hristache et al. (2001a), a “truncated” matrix 4, is considered, 
which coincides with .@ if L equals n. Let {wr,@= 1,...,L} be a set of vectors of R” satisfying 
the conditions n~! %1 Yeye i = 5p for every 0,0’ € {1,...,L}, with ô «~ being the Kronecker 
symbol. Define 


Be =n! E VAX, (4) 
i=1 


and .@, = £} BB). By the Bessel inequality, it holds z, < .@. Here and in the sequel, for 
two symmetric matrices A and B, A < B means that B — A is positive-semidefinite. Moreover, since 
M Mi, = MM, any eigenvector of M is an eigenvector of 4L. Finally, by the Parseval equality, 
M, = MifL=n. 

The reason of considering the matrix 4; instead of .@ is that 4, can be estimated much better 
than .@. In fact, estimators of ⁄ have poor performance for samples of moderate size because 
of the sparsity of high dimensional data, ill-posedness of the gradient estimation and the non-linear 
dependence of .@ on Vf. On the other hand, estimation of %4; reduces to the estimation of L linear 
functionals of V f and may be done with a better accuracy. The obvious limitation of this approach 
is that it recovers the EDR subspace entirely only if the rank of 4, coincides with the rank of M, 
which is equal to m*. To enhance our chances of seeing the condition rank(.4,) = m* fulfilled, we 
have to choose L sufficiently large. In practice, L is chosen of the same order as n. 

In the case when only an estimator of V f is available, the above described method of recovering 
the EDR directions from an estimator of .4, has a risk of order J/L/n (Hristache et al., 2001a, 
Theorem 5.1). This fact advocates against using very large values of L. We desire nevertheless 
to use many linear combinations in order to increase our chances of capturing the whole EDR 
subspace. To this end, we modify the method of extracting the structural information from the 
estimators By of vectors By. 

Let m > m* be an integer. Observe that the estimator TI, of the projector II* based on the PCA 
solves the following quadratic optimization problem: 


minimize Y 6} (7-11 ĝo subject to r=, tl<m 5 
YB ( J ) ’ ( ) 
£ 


where the minimization is carried over the set of all symmetric (d x d)-matrices. The value m* 
can be estimated by looking how many eigenvalues of II, are significant. Let An be the set of 
(d x d)-matrices defined as follows: 


Am ={A:A=A',OXAXI,trA <m}. 

Define Ag as a minimizer of the maximum of the By (I —A)By’s instead of their sum: 

Am € arg min max B/ (I — A). (6) 

ACAm £ 

This is a convex optimization problem that can be effectively solved even for a large d although a 
closed form solution is not known. It is noteworthy that a solution to (6) is not necessarily a pro- 
jection matrix. In fact, the matrices from An are symmetric positive-semidefinite with eigenvalues 
between 0 and 1 and not just 0 or 1. This enlargement of the search space guarantees its convexity, 
which is needed for the algorithm to be tractable. Moreover, as we will show below, the incorpo- 


ration of (6) in the structural adaptation yields an algorithm having good theoretical and empirical 
performance. 
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3. Theoretical Features of SAMM 


Throughout this section the true dimension m* of the EDR subspace is assumed to be known. Thus, 
we are given n observations (Y1,X1),...,(Yn,Xn) from the model 


Y; = f(X)) +e; = g(0] Xi... Bye Xi) +£, 


where €1,...,€, are independent centered random variables. The vectors 0; are assumed to form 
an orthonormal basis of the EDR subspace entailing thus the representation II* = re 0 oe In 
what follows, we mainly consider deterministic design. Nevertheless, the results hold in the case of 
random design as well, provided that the errors are independent of X1,...,X,. Henceforth, without 
loss of generality we assume that |X;| < 1 for any i=1,...,n. 


3.1 Description of the Algorithm 


The structure-adaptive algorithm with maximum minimization consists of following steps. 


a) Specify positive real numbers ap, an, Pı and hı. Choose an integer L and select a set {we, L < 
L} of vectors from R” verifying |wy|? =n. 


b) Setk=1 and Âo = 0. 
c) Define the estimators Vf,(Xi) for i = 1,...,n by formula (3) with 
wij = K(X; J I+ py Âr- 1) X;;/hz). Set 


= Lhe i) Weis €=1,...,L, 


where Wy, is the ith coordinate of wy. 
d) Define the new value A; by A; € arg minge g, MaXy ÎL —A)Bex. 
e) Set Pk+1 = ap ` Px, hk+1 = an : hy and increase k by one. 
f) Stop if Pk < Pmin or hk > hmax, otherwise continue with the step c). 


Let k(n) be the total number of iterations. We denote by Í, the orthogonal projection onto the space 
spanned by the eigenvectors of Aun ) corresponding to the m* largest eigenvalues. The estimator of 
the EDR subspace i is then the image of Íi. 

Both Ann ji and Í, are estimators of the projector onto S. Our theoretical results are stated for 
the estimator I, but similar results are valid for Akin ), too. The numerical simulations we made 
showed that these two estimators have comparable performances. 

The described algorithm requires the specification of the parameters p1, h1, dp and ap, as well 
as the choice of the set of vectors {wv}. In what follows we use the values 


Pl = 1, Pmin — n/V) a= e7 1/2(3Vm") | 
hy =Con7!/ 0Y) hmaz =2Vd, ay = e!/24vd), 


This choice of input parameters is up to some minor modifications the same as in Hristache et al. 
(2001b), Hristache et al. (2001a) and Samarov et al. (2005), and is based on the trade-off between 
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the bias and the variance of estimation. The constant Co will be chosen in a design-dependent 
manner taking into account the fact that the local neighborhoods used in (2) should contain enough 
design points to entail the consistency of the estimator. The choice of L and that of vectors we will 
be discussed in Section 4. 


3.2 Assumptions 


Prior to stating rigorous theoretical results we need to introduce a set of assumptions. From now on, 
we use the notation 7 for the identity matrix of dimension d, ||A||? for the largest eigenvalue of A TA 
and ||A||2 for the Frobenius norm of A (square root of the sum of squares of elements of A). 

We start with the smoothness assumption ensuring the adequacy of the local linear approxima- 
tion of the regression function. 


(A1) There exists a positive real Cg such that |Ve(x)| < Cg and |g(x) — g(x’) — (x — x’) 'Ve(x)| < 
C,|x — x'|? for every x,x/€ R”. 


Unlike the smoothness assumption, the assumptions on the identifiability of the model and the 
regularity of design are more involved and specific to our algorithm. The formal statements read as 
follows. 


(A2) Let the vectors By € R? be defined by (4) and let B* = Tp = Y} cebe : Zi lel < 1}. There 


exist vectors B byee, Bre € B* and constants u1, ...,Um* such that 
m* = oat 
TT < $ Beby - (7) 
k=1 


We denote u* = wy +... + Um. 


Remark 1 Assumption (A2) implies that the subspace S = Im(II*) is the smallest DR subspace, 
therefore it is the EDR subspace. Indeed, for any DR subspace S', the gradient V f(X;) belongs 
to S' for every i. Therefore By € S' for every £ < L and B* C S'. Thus, for every B° from the 
orthogonal complement S'+, it holds \TI*B°|? < Yi. uilB, B|? = 0. Therefore S'+ c SŁ implying 
thus the inclusion $ C S'. 


Lemma 2 Jf the family {y} spans R”, then assumption (A2) is always satisfied with some py (that 
may depend on n). 


Proof Set ¥ = (wy,...,wz) €R', Vf =(VF(X1),-.-, Vf(Xn)) € R?” and write the d x L matrix 
B = (B1,..-,Bz) in the form Vf -¥. Recall that if M,,M» are two matrices such that M; - M3 is well 
defined and the rank of M2 coincides with the number of lines in M2, then rank(M; - M2) = rank( M1). 
This implies that rank(B) = m* provided that rank(¥) = n, which amounts to span({we}) = R”. 

Let now B1,...,Bm: be a linearly independent subfamily of {ßB¢, < L}. Then the m*th largest 
eigenvalue Am (4) of the matrix A = yea BBT is strictly positive. Moreover, if v1,... ,Vm* are 
the eigenvectors of .W corresponding to the eigenvalues A1 (4) >... > Am (A) > 0, then 


* 
m 1 


M = Y vvi X o Y a veg. = Ane (MM = Pome MO" E BaB. 
k=l Am (M) k=1 k=l 
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Hence, inequality (7) is fulfilled with uz = Am (4)! for every k = 1,...,m*. a 


These arguments show that the identifiability assumption (A2) is fairly weak. In fact, since we 
always choose {wy} so that span({wr}) = R”, (A2) amounts to requiring that the value u* remains 
bounded when n increases. This assumption is much weaker than the coverage assumption under 
which the consistency of the inverse regression based methods is proved. 

Let us proceed with the assumption on the design regularity. Define Př = J and Pf = (I + 

) 


p, TI*)~!/ for every k > 2. Next, set Ze = (MP) ~'X;; and for any d x d matrix U put w(U = 


k k _(k k k k k 
K((Z) Tz), wP U) = K'((ZP) TZ), NO (U) =E ;w® (U) and 


n al 
1 1 
yWU)=) (z0) (z2) wh; (U). 
ij ij 


j=l 


(A3) For some positive constants Cy,Cx,Cx’,Cy and for some a €]0, 1/2], the inequalities 


k 


Wu yw) <cy, i=in 


M= 
Sa 
`~ 
S 
= 
= 
= 


(U) <Cx, J=1,...,n, 


M= 
z 
~= 
S 
— 
= 
~ 


(U) < Cx, J=1,..;n, 


Il 
An 








M= 
z 
~ 
s 
< 
z 
~ 


(U)<Cy  i=1,...,n, 


es 
Il 
hn 


hold for every k < k(n) and for every d x d matrix U verifying ||U — I||2 < a. 


Remark 3 Note that in (A3) we implicitly assumed that the matrices y® are invertible, which may 


be true only if any neighborhood E® (Xì) = {x : |U + pz’ O=) 12 (X; — x)| < hg} contains at least 
d design points different from X;. The parameters hı, Pı, ap and an are chosen so that the volume 
of ellipsoids E® (X;) is a non-decreasing function of k and Vol(E® (X;)) = Co/n. Therefore, from 
theoretical point of view, if the design is random with positive density on |0,1]4, it is easy to check 
that for a properly chosen constant Co, assumption (A3) is satisfied with a probability close to one. 


Ka 


involved in the definition (3), a small full-rank matrix to be sure that the resulting matrix is invertible, 
see Section 4. This observation also implies that the SAMM procedure can not be applied in the 
case where the sample size n is smaller than or equal to the dimension d of predictors. 


gooey 


the matrix 


(A4) The errors {¢;,i < n} are centered Gaussian with variance 0°. 
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3.3 Risk Bounds for the Projection Matrix Estimation 


In this section, we present the main result of the paper assessing the quality of the estimator Í, 
of the projection matrix II* in the asymptotics of large samples. To this end, we assume that the 
kernel K used in (3) is chosen to be continuous, positive and vanishing outside the interval [0, 1]. 
The vectors we are assumed to verify 
max max ‘|< 8 
ae vel be (9) 
for some constant w independent of n. In the sequel, we denote by C,C),... some constants depend- 
ing only on m*,u",Cy,Cy,Cx,Cx,Cy and W. 


Theorem 4 Let assumptions (A1)-(A4) be fulfilled. There exists C > 0 such that for any z € 
]0,2,/log(nL)] and for sufficiently large values ofn, it holds 


A 2 Pk 2 = 
P( t(U- f) > Cno sir ig + EE ) crer 4. HOS. 
n(1 E Cn) n 


where co = WV'dCxCy, tn = O(,/log(Ln)) and Gn = Oltan Er), 
Corollary 5 Under the assumptions of Theorem 4, for sufficiently large n, it holds 


ae | 2-1 3k(n)—5 


nN 2 
P| TL, — O* |p > Cn nr t2 + < Lze 7 + 
(iñ, -mi pa Ga ) 


(|, — 1" ll2) scà Veer Paro 





Proof Easy algebra yields 


IE, -|2 = (0, — 0")? = f- 2r f, + tr" 
< trf, +m* -2r fi," < 2m* —2tr ÎI. 


The equality tr II* = m* and the linearity of the trace operator complete the proof of the first in- 
equality. The second inequality can be derived from the first one by standard arguments in view of 
the inequality |I, — I1*||5 < 2m*. a 


According to these results, for m* < 4, the estimator of the orthogonal projector onto 5 provided 
by the SAMM procedure is \/n-consistent up to a logarithmic factor. This rate of convergence is 
known to be optimal for a broad class of semiparametric problems, see Bickel et al. (1998) for a 
detailed account on the subject. 


Remark 6 The inspection of the proof of Theorem 4 shows that the factor tÈ multiplying the “bias” 
term n~2/3Y"™") disappears when m* > 3. 


Remark 7 The same rate of convergence remains valid in the case when the errors are not neces- 
sarily identically distributed Gaussian random variables, but have a bounded exponential moment 
(uniformly in n). This can be proved along the lines of Proposition 14, see Section 5. 
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3.4 Risk Bound for the Estimator of a Basis of the EDR Subspace 


The main result of this paper stated in the preceding subsection provides a risk bound for the esti- 
mator Ty of II*, the orthogonal projector onto S. As a by-product of this result, we show in this 
section that a similar risk bound holds also for the estimator of an orthonormal basis of S. This 
means that for an arbitrarily chosen orthonormal basis of the estimated EDR subspace $ = In(II,), 
there is an orthonormal basis of the true EDR subspace 5 such that the matrices built from these 
bases are close in Frobenius norm with a probability tending to one. 


Proposition 8 Let the assumptions of Theorem 4 be fulfilled. For any orthonormal basis ® Sees nae 
of the estimated EDR subspace S = Im(II,,) there exists an orthonormal basis 8, ...,8m- of the true 
EDR subspace S = \m(II*) such that, for sufficiently large n, it holds 





a 2(4/m* 1 uF 2 a 
p(18, -012 > Cn g+ (vm*+1)v Eze) ı 3k(n) —5 


gize or p ere) 
n(1— Gn) n 


where ©, (resp. ©) is the d x m* matrix whose jth column is ĝ j (resp. Bj). 


Proof Using the singular value decomposition, we write "6, = UAV', where U and V are 
orthogonal matrices and A is a diagonal matrix. Let us denote by Àj, uj, vj respectively the jth 
diagonal entry of A, the jth column of U and the jth column of V. Since II*O,v jį = Ajuj, we have 
dj = |Ajuj| = IT*,v,| < 1. On the other hand, 


Aj = |1*O,v,| > |O,v;| - (0, - 0*)6,v;] > 1- ||, — T|, 


where we used the fact that |Onv i v} O; Onv j= viv j= 1. Let us define the matrix © as follows: 
© = UldymV |, where Iqxm« is the d x m* diagonal matrix with all diagonal entries equal to one. 
One easily checks that © is orthogonal, that is ©'© = I». Moreover, we have © = 1*6,VA-V', 
where A` is the m* x m* diagonal matrix having a5" as jth diagonal entry. Note that if the norm 
of II, — IT* is less than 1, the eigenvalues À j are strictly positive. In this case, A” is well defined 
and we obviously have II*® = ©. Thus the columns of © form an orthonormal basis of Im(II*). 
Furthermore, we have 


On —Oll2 < || —11°@,,||2 + ||(1* — T1,) || 
< ||U (axm — AV" l2 + || — Tale 


m* 


< (È -DA + |" —Thalle 
j=l 


< (Vrč + DIIT — Tl 
provided that ||I1* —I1,,|| < 1. This implies that for every d € (0, 1) the event {||I1* —T,||2 < d} is 


included in {llOn — @||2 < (v m* + 1)d}. By virtue of this inclusion, the assertion of the proposition 
follows from Corollary 5. E 
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4. Simulation Results 


The aim of this section is to demonstrate on several examples how the performance of the algorithm 
SAMM depends on the sample size n, the dimension d and the noise level o. We also show that our 
procedure can be successfully applied in autoregressive models. Many unreported results show that 
in most situations the performance of SAMM is comparable to the performance of SA approach 
based on PCA and to that of MAVE. A thorough comparison of the numerical virtues of these 
methods being out of scope of this paper, we simply show on some examples that SAMM may 
substantially outperform MAVE in the case of large “bias”. Our results also show that SAMM and 
MAVE provide more accurate estimates of the EDR subspace than inverse regression based methods 
: inverse regression based on Minimum Discrepancy Approach (MDA) introduced in Cook and 
Ni (2005) and Sliced Average Variance Estimation (SAVE) of Cook and Weisberg (1991). In all 
simulations for inverse regression based methods, the number of slices is chosen to minimise the 
risk. 

The computer code of the procedure SAMM is distributed freely, it can be downloaded from 
http://code.google.com/p/samm07/. It requires the MATLAB packages SDPT3 and YALMIP. 
We are grateful to Professor Yingcun Xia for making the computer code of MAVE available to us. 

To obtain higher stability of the algorithm, we preliminarily standardize the response Y and the 
predictors X O), More precisely, we deal with Y; = Y;/oy and Š = diag(Zy)~!/ 2X, where oF is the 
empirical variance of Y, Xy is the empirical covariance matrix of X and diag(Ly) is the d x d matrix 
obtained from Ly by replacing the off-diagonal elements by zero. To preserve consistency, we set 
Bern) = diag(Zy) Be xn), where Bean) is the last-step estimate of By, and define Thin) as the 
solution to (6) with By replaced by Bern): Furthermore, we add the small full-rank matrix Jy) /n to 
Ei (x,)(x,) wi in @). 

In all examples presented below the number of replications is N = 250; for each replication, 
a new sample of the design and the error vector (€;,...,€,) has been generated at random. The 





mean loss ery = iz jer; and the standard deviation v x Ł (er; — ery)? are reported, where er; = 


IAO — 11*|| with TI being the estimator of IT* for jth replication. 


4.1 Choice of {47,4 < L} 


The set {ye} plays an essential role in the algorithm. The optimal choice of this set is an important 
issue that needs further investigation. We content ourselves with giving one particular choice which 
agrees with theory and leads to nice empirical results. 


Let Gj, j < d, be the permutation of the set {1,...,n} satisfying Ke Ke Xe x: Let 
J J 


G7 be the inverse of G ;, that is, 6 (6; '(k) =k for every k = 1,...,n. Define {ye} as the set of 
vectors 


n 


2 
(sin (ED), sii e ”)' 


n 


= 








(cos 72 US) 7 cos ( mie) qT 
- i kia, sah 


normalized to satisfy 1", Wei = n for every 4. It is easily seen that these vectors satisfy conditions 
(8) and span({w}) = R”, so the conclusion of Lemma 2 holds. Above, [n/2] is the integer part of 
n/2 and k and j are positive integers. 
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The idea behind the above described choice of vectors wy is the following: if the design is 
uniformly distributed in [0,1]¢ and H(x) is a function R? — R depending only on one coordinate 
of x, the projections of the vector g = (H(X1),...,H(Xn))' on some of directions wy are nearly 
equal to the Fourier coefficients of H. Indeed, for n odd and for every fixed j, the vectors {ex j = 
(0x(S;'(1)/n),0x(G;'(2)/n),... x(651(n)/n)) 131 < k <n} with {bx} being the trigonometric 
basis (that is $2,(x) = V2sin(2mpx) and $2,41(x) = V2cos(2mpx) for every p € N) form an or- 
thonormal basis of R”. Therefore, for any function H from Rf to R, which depends exclusively on 
the j!” coordinate x) of x, one has g! ex; = 1 (XP olz (i) /n) = My Ho(Xg i )x(i/n) 
for some function Hp : R — R. Since for a sample xë ) aes XV ) drawn from uniform distribu- 
tion in [0, 1] the order statistics are nearly equal to i/n, we get +g! eg; © +", Ho(i/n)ox(i/n) ~ 
(Ho, 9x) 2 0,1); Note that although this explanation is valid only for uniform design and a function 
H depending only on one coordinate, empirical results show that this choice leads to satisfactory 
results in more general situations. 


4.2 Example 1 (Single-index) 
We set d = 5 and f(x) = g(8'x) with 


g(t) =4|t|!/? sin? (zt), and 0 = (1/V5,2/V5,0,0,0)' € R5. 
We ran SAMM, MAVE, MDA and SAVE procedures on the data generated by the model 
Y; = f(X;) +0.5-€;, 


where the design X is such that the coordinates (x ) J <5,i <n) are iid. uniform in [—1, 1], and 
the errors £; are i.i.d. standard Gaussian independent of the design. 

Table 1 contains the average loss for different values of the sample size n for the first step 
estimator by SAMM, the final estimator provided by SAMM and the estimators based on MAVE, 
MDA and SAVE. The first observation is that inverse regression based methods are not consistent 
in this case. We plot in Figure 1 the average loss normalized by the square root of the sample 
size n versus n. It is clearly seen that the iterative procedure improves considerably the quality 
of estimation and that the final estimator provided by SAMM is ,/n-consistent. In this example, 
MAVE method often fails to recover the EDR subspace. However, the number of failures decreases 
very rapidly with increasing n. This is the reason why the curve corresponding to MAVE in Figure 1 
decreases with a strong slope. 


4.3 Example 2 (Double-index) 
For d > 2 we set f(x) = g(8' x) with 


g(x) = (x1 — x9) (xj +22); 


and 8; = (1,0,...,0) € R%, 02 = (0,1,...,0) € R?. We ran SAMM, MAVE, MDA and SAVE 
procedures on the data generated by the model 


Y; = f(X;)+0.1-£;,  i=1,...,300, 
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Figure 1: Average loss multiplied by y/n versus n for the first step (solid line) and the final (dotted 
line) estimators provided by SAMM and for the estimator by MAVE (dashed line) in 
Example 1. 





n 200 300 400 600 800 
SAMM, Ist 0.443 0.329 0.271 0.215 0.155 
(211) (120) (115) (095) (.079) 

SAMM, Fnl 0.337 0.170 0.116 0.076 0.053 
(.273) (147) (104) (054) (.031) 





MAVE 0.626 0.455 0.249 0.154 0.061 
(363) (408) (342) (.290) (.161) 
MDA 0.882 0.885 0.890 0.885 0.882 
(144) (141) (130) (142) (.148) 
SAVE 0.857 0.847 0.832 0.818 0.782 


(145) (144) (154) (168) (169) 





Table 1: Average loss ||I1—II*|| of the estimators obtained by SAMM, MAVE, MDA and SAVE 
procedures in Example 1. The standard deviation is given in parentheses. 


where the design X is such that the coordinates (x ) j <d,i <n) are iid. uniform in [—40,40)], 
and the errors £; are i.i.d. standard Gaussian independent of the design. The results of simulations 
for different values of d are reported in Table 2. 


As expected, we found that (cf. Figure 2) the quality of SAMM, as well as the quality of SAVE, 
deteriorated linearly in d as d increased. This agrees with our theoretical results. It should be noted 
that in this case MAVE and MDA fail to find the EDR subspace. 
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Figure 2: Average loss versus d for the estimators provided by SAMM (dotted line), by MAVE 
(dashed line), by MDA (dash-dot line) and by SAVE (solid line) in Example 2. 





d 4 6 8 10 12 
SAMM Ist 0.154 0.242 0.296 0.365 0.421 
(.063) (.081) (.071) (.087) (.095) 

SAMM, Fnl 0.028 0.048 0.060 0.077 0.098 
(.011) (.020) (.021) (.026) (.037) 





MAVE 0.284 0.607 0.664 0.681 0.693 
(.147) (.073) (.052) (.054) (.044) 
MDA 0.768 0.894 0.938 0.964 0.973 
(.232) (.142) (.095) (.062) (.049) 
SAVE 0.129 0.179 0.222 0.259 0.299 


(.048) (.047) (.050) (.058) (.071) 





Table 2: Average loss ||F1—IT*|| of the estimators obtained by SAMM, MAVE and MDA procedures 
in Example 2. The standard deviation is given in parentheses. 


4.4 Example 3 
For d = 5 we set f(x) = 9(0'x) with 
g(x) = (1 +21) (1 +2) (1 +23) 


and ©; = (1,0,0,0,0), ®2 = (0,1,0,0,0), 83 = (0,0,1,0,0). We ran SAMM, MAVE, MDA and 
SAVE procedures on the data generated by the model 


Y; = f(X) +0-&, i=1,...,250, 
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o 200 150 100 50 25 10 
SAMM ist 0.227 0.177 0.141 0.119 0.113 0.106 
(092) (075) (.055) (051) (.048) (.043) 

SAMM, Fnl 0.125 0.084 0.057 0.039 0.034 0.030 
(.076) (.037) (.026) (.019) (.021) (.018) 

MAVE 0.103 0.087 0.073 0.062 0.063 0.059 
(041) (035) (.027) (.023) (.024) (.023) 

MDA 0.854 0.850 0.867 0.862 0.858 0.873 
(167) (173) (.157) (159) (171) (159) 

SAVE 0.510 0.511 0.496 0.505 0.496 0.490 
(208) (204) (207) (197) (196) (199) 





Table 3: Average loss ||II — TI*|| of the estimators obtained by SAMM and MAVE procedures in 


Example 3. The standard deviation is given in parentheses. 


where the design X is such that the coordinates (x ) j <d,i <n) are i.i.d. uniform in [0,20], and 
the errors £; are i.i.d. standard Gaussian independent of the design. 


Figure 3 shows that the qualities of both SAMM and MAVE deteriorate linearly in 6, when 
© increases. These results also demonstrate that, thanks to an efficient bias reduction, the SAMM 
procedure outperforms MAVE when stochastic error is small, whereas MAVE works better than 


SAMM in the case of dominating stochastic error (that is when o is large). 


0.4 





n — U*|| 











—+— SAMM ist 


— + — MAVE 


SAMM Fnl |- 

















Figure 3: Average loss versus o for the first step (solid line) and the final (dotted line) estimators 
provided by SAMM and for the estimator based on MAVE (dashed line) in Example 3. 
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4.5 Example 4 (Time Series) 


Let now 7;,...,7;,46 be generated by the autoregressive model 
Ti+6 = f (Tiss, Ti+4, M43, Ti, Ti, Ti) + 0.2 ‘Ei, i=1,... MN, 
with initial variables 7,,...,7%@ being independent standard normal independent of the innovations 


€;, which are i.i.d. standard normal as well. Let now f(x) = g(0'x) with 
g(x) = —1 + 0.6xı — cos(0.52x2) +e, 

and 

© = (1,0,0,2,0,0)/V5, 

® = (0,0, 1,0,0,2)/V5, 

03 = (—2,2,—2,1,—1,1)/ v15. 
We ran SAMM and MAVE procedures on the data (X;,Y;), i= 1,...,250, where Y; = T;}6 and 
Xi = (T;,...,T;45)' . The results of simulations reported in Table 4 show that the qualities of SAMM 


and MAVE are comparable, with SAMM being slightly better. SAVE is better than MDA, but both 
of them are far less accurate than SAMM and MAVE. 





n 300 400 500 600 
SAMM, 1st 0.391 0.351 0.334 0.293 
(172) (161) (137) (132) 

SAMM, Fn! 0.220 0.186 0.174 0.146 
(119) (123) (.102) (.089) 





MAVE 0.268 0.231 0.209 0.182 
(.209) (170) (.159) (122) 
MDA 0.914 0.915 0.913 0.912 
(115) (¢107) (119) (.119) 
SAVE 0.617 0.515 0.428 0.369 


(.200) (184) (151) (138) 





Table 4: Average loss ||I1—IT*|| of the estimators obtained by SAMM, MAVE, MDA and SAVE 
procedures in Example 4. The standard deviation is given in parentheses. 


5. Proofs 


Since the proof of the main result is carried out in several steps, we give a short road map for 
guiding the reader throughout the proof. The main idea is to evaluate the accuracy of the first step 
estimators of By and, given the accuracy of the estimator at the step k, evaluate the accuracy of the 
estimators at the step k+ 1. This is done in Subsections 5.2 and 5.1. These results are based on a 
maximal inequality proved in Subsection 5.4 and on some properties of the solution to (6) proved 
in Subsection 5.5. The proof of Theorem 4 is presented in Subsection 5.3, while some technical 
lemmas are postponed to Subsection 5.6. 
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5.1 One Step Improvement 


Let {5;,} be a sequence of positive numbers to be chosen later and let 2; = {A E Am : tr(I—A)TI* < 
öz}. Recall that we use the following notation: 


Baup T EN ts a UER Zz) 


3 
E CS) Orn of 1 1 (k) 
“Eno, vPo-È (210) (A) Po, 


where 4 isa a x d symmetric positive-semidefinite matrix. Let us define Sg = (J+ pr Ârı ) 1/2 and 


One eal checks that the estimator Vf f(X) is given by 


(Ao EEE ween En) 


Simple algebra yields 
hg ' fk(Xi) n 1 
( ee =h; 1y (k (U7 DS Y; (z) wO (Up). 
PEV f (Xi) j=l ij 


In order to study the behavior of V f}, we will proceed in a first step as if Up were deterministic. For 
this reason, the notation 


hy fre(Xi) jel zi (z o) (k) 


will be useful. In fact, V f,(X;) defined as above would be the expectation of Vi f (Xi) if Up were 
deterministic. 


Proposition 9 Let assumptions (A1)-(A4) be fulfilled. If for some integer k € {2,k(n)] the real 
number Qk = 282P + 28-19; | is less than the constant appearing in assumption (A3), then 
there exist Gaussian vectors 5} ,,--+,57 p € R? such that maxı<e<;, EE? 417] < cho? and 





A ap 2 
P |r; 7 a ee Åi E ee 
(pas, k (Bek — Be) — Tale k, An E Pr- ) ey 


where we used the notation Y; = VCvCe(Pe+5p—1) hk +c100gtn/(Vnhg) with t, =4+ (3log(Ln) + 
3d? logn)", co = W(dCxCy)'/? and cı = 159 (CCC + CCR) 
zd“ logn) /4, co = (dCrCy) /4 and cy = 150 (Cy,.Cy Ce + Cy Ck) "^ 


roor Ri us start with evaluating the “bias” term |P% (Bex — Be) |, where the vectors Bex are defined 
as zbi- VS Vfl Xj) Wei. According to the Cauchy-Schwarz inequality, it holds 


2 


Pi (Bex — Be) |? =n? oe i) — VE (Xi)) Wei 


te ETAO X;)) |? Lvi 
< max (Pe (VFX) — VEX)? , 


geseg 





1664 


ESTIMATION OF THE DIMENSION-REDUCTION SUBSPACE 


Simple computations show that 


[Pe (V FX) — VF(Xi))| < 





r AXi 
be a) (evra) 
vr seb) 10> (Fer) 
U È ru(z P Uo] =, 


where rij = f(X;) — f(Xi) — XI V(X). Define vj = VP (U) "? (zo) wO (Up), Aj = 
I riz wh (Up) and A= (Au,..-,An)™. Then Dj vjv) = lapi and 


b(X;) = we ale Ya 
j=l 





a 





< |v Uo]: 


Note now that in view of Lemma 21, ||U; — I||2 < Œg on the event {Âr E Pr—ı}. Therefore, 
b) <n? ve woef be rwhp 


= (k) = (k = 
<a? max rh; (Us) sees max rh 
jwẸ (Ux) #0 j=l jwẸ (Ux) #0 


Let us denote by © the (d x m*) matrix having 8; as Ith column. Then II* = @O' and therefore, in 
view of (A1), 


Iryl = SA) -SX -Xy VIX) 
= |g(@'X;) —g(O'X;) — (@'X;;) Ve(@'X;)| 
<C,|0'X;;|? = C| Xil. 
H are defined via the kernel function K vanishing on the interval [1,°°[, we have 
pe (U,) 40 r= = max{r?, : |SkXi;| < hk}. By Corollary 19, the inequality |S,X;;| < hg implies 
|IT*Xi;| < (Pk + S¢-1)hx. On the other hand, |IT*X;;| < |X| < |SeXij| < hx. These estimates yield 
|b(X:)| < VC Cef (pe + 8x1) A 1}7hy, and consequently, 


Since the oe w 
max 


je \Pé (Bex — Be) | < max b(X;) < SCy Ce{ (Pk +81) A 1 Pg. (9) 


Let us evaluate now the “stochastic” error P; (Bex — Box). Define E; as the d x (d+ 1) matrix (0 T), 
where 0 stands for the vector all coordinates of which are zero and J is the d x d identity matrix. 
Using this notation, we have P* (Bex — Bex) = Li- cjelUk) £j, where 


cje(Uk) = wg BEV ur'(, b)» O Uye. 
Zij 
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Let us define €7 , = V/nhy X;=1 cj e(I) €;. Clearly, the vectors €7 , are centered Gaussian and, in view 
of Lemma 22, they satisfy E[|é% 417] < nho’ Yj |cje(D |? < eGo”. 
By virtue of Lemma 21, on the event {Ay E Y,_;}, for any £= 1,...,L we have 





Xí (cje(U) — cj e(1)) ejl. 


|P; (Bex — Bex) — 


* 
tk 
|< sup 
Vinh \|U- T\|2<O, 


Set aje(U) =cje(U) — cj elI). Eee 23 and inequality (12) imply that Proposition 14 can be 























applied with Kọ = Tih and Kı = Toi T Setting € = 204 / y/n we get that the probability of the event 
n 
c1004(4 + y3 log(Ln) +3d?log(vn)) 
sup cjelU)—cjell)) e; > 
{s È (ei) ~esalD)e ei 
is less than 2/n. This completes the proof of the proposition. E 


Corollary 10 If nL > 6 and the assumptions of Proposition 9 are fulfilled, then 


2 


OCoZ ^ zal 
Ax-1 © Pk 1) Sz aa 


vnhg’ 
In particular, if nL > 6, the probability of the event 








P ( max |P (Bex — By)| > Y4 


20coy/log(Ln) 
vnhg 


does not exceed 3 /n, where Yy and co are defined in Proposition 9. 





{ max [Pi Bea — Bo) > Y4 lati E Pr-ı} 


Proof In view of Lemma 7 in Hristache et al. (2001b), we have 


` 


P( max, |6] 2 zco) < EP (lézel = zeos) < Lee" DP, 


The choice z = ,/4log(nL) leads to the desired inequality provided that nL > 6. a 


5.2 The Accuracy of the First-step Estimator 


Since at the first step no information about the EDR subspace is available, we use the same band- 
width in all directions, that is the local neighborhoods are balls (and not ellipsoids) of radius h. 
Therefore the first step estimator By; of the vector By is the same as the one used in Hristache et al. 
(2001a). 


Proposition 11 Under assumptions (A1), (A3), (A4) and (8), for every £ < L, there exists a d- 
dimensional zero mean Gaussian vector or so that 


i i 
Be —Be- am = | SMC J/Cy, 


and E\&} |? < do*CyCkW’. 
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Proof Since Př coincides by definition with the identity matrix, the arguments used in the proof 
of Proposition 9 apply with Sı = / and therefore do = @ı = 0. More precisely, in view of (9) and 
pı = 1, we have |Be1 — Be| < hiv CvG, for all £, while in view of the relation U; = J, we have 
Bea — Bet = an 81° This yields the desired result. a 


Corollary 12 If nL > 6 and the assertions of Proposition 11 hold, then 


1 
<e 


n 





. 2/dCyCxlog(nL) of 
P —-BI>h H 
(max lĝe: Bel > hC vC myn 


Remark 13 Jn order that the kernel estimator of V f(x) be consistent, the ball centered at x with 
radius hı should contain at least d points from {X;,i = 1,...,n}. If the design is regular, this means 
that hy is at least of order n~'/4. The optimization of the risk of Bie with respect to hı verifying 
hı >n~*/4 leads to hy = Const.n~'/4V), This motivates the choice of hı presented in Section 3. 


5.3 Proof of Theorem 4 


Recall that at the first step we use the following values of parameters: Ag = 0, pı = 1 and hy = 
n—!/(4V4) Let us denote 





20W,/2dCyCx log(nL 
Yı = MCV Cy + Ww JE > ), 8) = 2y1 Ve", 


and introduce the event Q; = {max, Îi — Bel < Yı}. According to Corollary 12 the probability 
of the event Q, is at least 1 —n~!. In conjunction with Proposition 17, this implies that P(tr(7 — 
A) <&)>1-a 1. 

Recall that for any integer k € [2,k(n)|—-where k(n) is the total number of iterations—we use the 
notation Px = dpPx_—1, hk = anhg_1 and O = 282p + 28,-1P; Let us introduce the additional 
notation 








Me am yani Foa k= k(n), 
Ce = Qu" (YEP, + V2 WPr Ce), 
Sk = Yeu / V1 — Gr, 


Or = {max [P Bra — Bo)| < Ya) 


1 ae k < k(n), 


Combining Lemmas 24 and 25, we obtain P(tr(7 — A,_1)T* > 82_,) < P(Q¢_,) and therefore, using 
Corollary 10, we get 
P(Q) < (max |P; (Bex — Be)| > Ye, An—1 € Bier) +P(Qi_1) 


< 


zlu ws 


+P(Q_1),  Yk<k(n)-1. 
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Since P(Q) < 1/n, it holds P(n) < (3k(n) —5)/n and, by virtue of Corollary 10, P(Q; 


Lze~(@-D/2 4. ao, In conjunction with Lemma 25, this yields 


(m) S 


3k(n) -5 dö 
n 


P(tr(I — Ajyn) T" > Skin) < Lze -D24 


According to Lemma 24, we have 84(n)—2 < Px(n)—1> Ok(n)—1 < 4 and Cun) < 1/2. Consequently, 
for n sufficiently large, we have 





VE L A o(a ) 2 yp -2/3Vn 
© i-una — Crn) i 


and O(n) < 48x(n) ~ Pin) < C[(\/log(Ln) (Prin vo) Vee |, Since hg(n) = 1 and (npn) < 


pi, =" = aa A 


, we infer that 


O(ZCo + €1Ox(n)tn) 


Vn 





Ye(n) = CoV Cy (Prin + 84¢n) 1) + 


_ * CoO Z 
<Ct?n 2/(3Vm Pa 
~ yn 


Therefore On := k(n) = O(YE(n)Piny) tends to zero as n tends to infinity not slower than 


log(nL) n~'/(6Y""") and the assertion of the theorem follows from (10), the definition of Sx(n) and 
Lemma 20. 


5.4 Maximal Inequality 


The following result contains a well known maximal inequality for the maximum of a Gaussian 
process. We include its proof for the completeness of exposition. Let Sg_; denote the unit ball of 
R’. 


Proposition 14 Let r be a positive number and let T be a finite set. Let functions a jy : RP — RI 
obey the conditions 


n 
sup sup)? lajy(u)|? < 9, 


YET |u—u*|<r j=1 
2 
< Kj 


n 


d 
sup sup sup Ł a, € tin) 


YET |u—u*|<reeSg-1 j=1 








for some u* € RP. If the €;’s are independent N(0,07)-distributed random variables, then 


Laval u)Ej 


2 
> roxy + 20x ae 
n’ 








P (sup sup 


YEL |u—u*|<r 





where t = ,/3log(|I|(2r/e)?n) and |I] is the cardinality of T. 
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Proof Let B, be the ball {u : |u—u*| < r} C R? and &,¢ be an g-net on B, such that for any 
u € B, there is an element u; € X„ẹ such that |u — u| < €. It is easy to see that such a net with 
cardinality N, < (2r/£)” can be constructed. For every u € B, we denoteny(u) = Li) 4j,y(u) £j. 
Since E(|ny(u)|?) < 076 for any Yy and for any u, we have 


P(|ny(u1)| > toxo) < P(my(u)| >t E(|ny(w)?)) <te-P D2, 


Thus we get 


Nre 
P(sup sup |ny(u)| > roxo) oy E P (Inun) > roxo) <|T\Npete 7? -D/2, 
YET ujeXy¢ yer /=1 





Hence, if t = \/3log(|I'|N,en), then P ( SUP yer SUP ex, Nyu) |> toxo) < 1/n. On the other hand, 


for any u,u' € B,, 


mu) —ny(w’) = sup fe! (ny(w) —ny(w’)) 




















e€Sg-1 
d(e" g 
<|u—u'|*- sup sup (e (u) 
ucB, e€S4—1 du 
n d(e'ajy) E 
= |u —u'|? - sup sup A IN 3 
ucB, e€Sq4—1 L du 4 
The Cauchy-Schwarz inequality yields 
2 2 n n 
Iny(w) — nyu’) | 2 |d(e!ajy) eS, gare 
< sup sup —————— (u E IK € 
|u —u'|? onp? du ( ) L l p? j 








Since P( = ef > 4no”) is certainly less than n~}, we have 


P ( sup sup |ny(u)| > toKo + 2y/noxıe) 





yceT ucB, 
< P( sup sup Myw )| > 1) +P(sup sup My) = Nyu (u))| > 1) 
YED EX TOKO yel ueB, 2/noK\€ 


1 n 2 
<- +P( sup Kiju — u (u)|* Lei > 4no? je”) <=, 
n ucB, j=l n 


and the assertion of proposition follows. a 


5.5 Properties of the Solution to (6) 


We collect below some simple facts concerning the solution to the optimization problem (6). By 
classical arguments, it is always possible to choose a measurable solution A to (6). This measura- 
bility will be assumed in the sequel. 
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In Proposition 15, the case of general m (not necessarily equal to m*) is considered. As we 
explain below, this generality is useful for further developments of the method extending it to the 
case of unknown structural dimension m*. 

The vectors By are assumed to belong to a m*-dimensional subspace S of Rf, but in this subsec- 
tion we do not necessarily assume that Bys are defined by (4). In fact, we will apply the results of 
this subsection to the vectors I* Be. 

For every A € Am, let us define 


R(A) = max 8! (—A)Br, Ân = argminR(A), 
1<£ <L 


AE Am 
= min yR =4/R = min max | — A 1/28 
AEC Am AE An 1 real ) Bel 


We also define 


*(m)= I-A)! 
R*(m) = wa Bel 


and denote by A*, a minimizer of max¢B/ (I —A)By over A € Am. Note also that for every m > m* 
the projector II* belongs to Amn. Therefore, we have Až, = II* and R*(m) = 0 for every m > m*. 


Proposition 15 Let B* = {B = Ze cebe : Ze ce] < 1} be the convex hull of vectors +By. If max¢ IBe— 
Bel < e, then 





max |(I—A,,)'/°B| < 
BEB* 


When m < m*, we have also the lower bound R (m) > (R*(m) —€) 4. 
Proof For every L€ 1,...,L, we have 
I-An) P Bel < IU -AR Be +1 — Ann) '/? (Be — Bo) 
< R*(m) + |ĝÎĝe— Bel < R*(m) +e 
Since Â, minimizes max, |(Z — A)!/2By| over A € Am, we have 


max |(I—Ay,)"/?B| < max (I—Aj,)""?Bel < R(n) +e. 


Since Âm € Am, we have 0 < (I — Âm) 2 < I and consequently, for every £, 


A A A 


|(I— Âm)" Bel < I(T —Am) el +10 — Âm)? (Be — Be)| 
< |(I— Âm)! Bel + Be — Be] < R* (m) +22. 


The second inequality of the proposition follows now from |(J — A,,)!/?B| < max, |(I — Âm)! Bel 
for every B € B*. 

Let us prove the last assertion of the proposition. According to the definition of R*(m), for 
every matrix A € An there exists an index (A) such that |({ —A)!/ *Bycay| > R*(m). In particular, 
|(7—Am)!/?By4,,)| = R*(m) and hence |(7— Âm) Boal 2 (Am) Bean l- Bean — Bea, = 
R*(m) —e. m 
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Remark 16 Proposition 15 can be used for estimating the structural dimension m. Indeed, R(m) < 
e for m > m* and R(m) > (R*(m) — €), for m < m*. Therefore, it is natural to search for the 
smallest value m of m such that the function R(m) does not significantly decrease for m > M. The 
rigorous application of this heuristic argument is currently under investigation. 


From now on, we assume that the structural dimension m* is known and we use the shortened 
notation A instead of A. 


Proposition 17 Ifthe vectors By satisfy (A2) and max, Be —By| < e, then tr(I — Â)II* < 4e*u* and 
tr[(A —II*)?] < 8e2u"*. 


Proof In view of the relations trA2 < trA < m* and tr(II*)* = trII* = m*, we have 
tr(A —II*)? = tr(A? — TI") +. 2tr(7 — A)T* < 2|tr(7 — A)". 


Note also that the equality tr(/ —A)TI* = tr(7 — Â) !"?TI* (1 — A)!/? implies that tr(J —A)T* > 0. Now 
condition (7) and Proposition 15 imply 


A 


t(I — ÂI = (I — Â) (I — A)? 

m* 

<P gtr — Â)" BaB U- Â)? 
k=1 
m* ais PO ; m* 

< Yi Bi (1A) By < (28)? Yow 
k=1 k=1 

and the assertion follows. E 


Lemma 18 Let tr(I — Â)II* < 8? for some & > 0. Then for any x € R? 
|TI*x| < |A!/2x| + 8 |x|. 
Proof In view of the triangle inequality, |I*x| < |TT*A!/2x| + |II*(7—A!/2)x|. On the other hand, 


In — AN?) x]? < | — AM?) [3 [a)? < wf (AN?) - |x. 





For every A € Am, it obviously holds (J — A12}? = I —2A!/? +A < I — A, and hence, trII* (I — 
Al/?)*11* < w IO*(I— A)O*. Therefore, 


trI* (I — Â"ZPT < wi (I — Â)D* = tr(I— Â)II* < 8? 


yielding |IT*x| < |[I*Â!/x| + 8|x| < |A!/2x| + 8|x| as required. a 


Corollary 19 /f for some p € (0,1) and for some x € R4, we have |(I+p~7A)!/2x| <h, then |IT*x| < 
(p+ 4/tr —A)T*)A. 
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Proof The result follows from Lemma 18 and the inequalities |x| < |(I+p~7A)!/?x| < h and 
|[Â"/?x| < p|(I+p~2A)'/2x| < ph. E 


Lemma 20 Let tr(I — Â)TI* < & for some ò € [0,1) and let D, be the orthogonal projection 
matrix in R? onto the subspace spanned by the eigenvectors of A corresponding to its largest m* 
eigenvalues. Then tr(I — Iim )O* < &/(1 — 87). 





Proof Let rj and 6; j i =1,...,d be respectively the eigenvalues and the eigenvectors of A. Assume 
that Îi > eS > we Then A = rL IA 00} and Pin = m Oj). Moreover, Li 8/8) =] 
since {å ae cae is an orthonormal basis of Re . This implies that 
[AIT] < E Ajr Â 1] + Ane Z re] 
j<m* j>m* 
A A A AT d A A 
= E Â; -în ) [OTIT] + åm tr] ô; én l 
gm j=l 
= E Âj- Ame) e887 0] + "Ane. 
igm* 


A 


Since tr[ĝ; 6} TI] = = |T*8d;)|* < 1, we get [AIT] < Yi<m«4j. Taking into account the relations 
Yi<ahj < m*, trIT* = m* and (1 mae ar +D) — Íi, )xI- A, we get Aiea < m*— Y jcn rj < 
tr[(7 — A)T1*] < & and therefore J — Pim < (1 — 8)! (I — Â). Consequently, tr[(J — Pp )II*] < 
(1 —8?)-! [7 — AJIT] < 8? /(1 — 8’). x 


5.6 Technical Lemmas 


This subsection contains five technical results. The first three lemmas have been used in the proof 
of Proposition 9, whereas the two last lemmas have been used in the proof of Theorem 4. 


Lemma 21 For every p € (0, 1] and for every A € Am we have 
||P5 (2+ p-7A)P5 — Ilo < 254p? + 284p 
where P3 = (I+p 71I*)~'? and 8} = tr[( — AJIT]. 
Proof The inequality PX < (J —II*) + pII* implies that 
p|F5(0-+p°2A)P il = IRA- 
<p? |E (4 -m9 |j + 0- (A -T a- n) |] 
+2p||0*(4 - T") - T") ||, 


2 


Since ||B||3 = trBBT < (tr(BB")!/?)? for any matrix B, it holds 
[E (4 — m], = ||" — A)" ||, 


< tr IŽ (I — AI = (I — A) = ô. 


l= 
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By similar arguments one checks that 





|0 -0*)(A - TU- = || -T)AU —-T") ||, < wd -*)A 
= trA —trIT* + trIT*(J— A) < &, 


and 


|T a-a- < a-r) = [E-A] 
< || -A)| = (tTa -Am 
= (tr(Z — A)IT*)!/2 = ô4. 


This leads to the inequality ||P3 (+ p-°A)P; — I||2 < 84(1+p 7) +284p |, which, in view of the 
condition p < 1, yields the assertion of the lemma. | 


Lemma 22 [fos and U satisfy (8) and (A3), then Y';_, leje(U)|? < dCxCyW /(nhz). 


Proof Simple computations yield 





























2 
L ı/ 1 = dC 
} ziv P U) (zo) wip (U) = (EVP (V) E) < < a1) 
j=1 ij N; (U) 
Hence, we have 
1 sly 1\ j 
leje = > E\V; wr i) (U) Wei 
d i ar 3 Zij 
P g (x NS Oral 1 \ pose 
a : BVP W (zo) uwo) 
hee > me L Zij j 
2 
-ıf 1 (k) (k) 
< < ae 2i Ey” wr lo) N; (U)w;; (U). 
Interchanging the order of summation and using inequality (11) we get the desired result. E 


Lemma 23 7f (A3) and (8) are fulfilled, then, for any e € Sq_, we have 




















d i < 2ACRCYCRY 2160C Y 
sup max gg € AV) < < s ri uae l 
U:|U-Il|2<1/2}51n 2 n-hy, nh; 
de! cje(U) 


where £ (e'c;,)(U) is the d x d matrix with entries We 
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Proof In order to ease the notation, we will remove the superscripts (k) in this proof. Thus, we will 


(k) 


write V;, w;; and Z;; instead of V, wid and re By definition of cj, we have 



































sae ON Ts Í AL 3 KA EvU A \|wi(v) i 
du HON S “Wnty & (au ZR E a 
1 7 1 \ dw;;(U) 5 
2 a Èi zT j , 
i nhy ¢ = a W) G) dU We, 2 


= Ai +42, 


where é = E e satisfies |é| < |e] = 1. One checks that d w;;(U)/dU = W;;(U)Z;; Zio where we used 


the notation w;;(U) = K'(Z,,UZ,;). On the one hand, |w;;(U)| - |Z|? = 0 if Z UZij > 1. On the 
other hand, the inequality ||Z — U ||2 < 1/2 implies that 


IZij? < ZU Zy + |Z (I —U)Zijl < ZU Zy + |Zu? M —U la < ZUZ + |Ziyjl°/2. 


Therefore |Z;;|? < 2 for all Z;j verifying Z;;UZj; < 1. Hence, ||d w;;(U)/dU||2 = |w; (U) |- |Zij|? < 
2|w;;(U)| and we get 


BY e ia * APG Ck 
A < =a V; (U ij(U | <p eee 
TEA A <S 
In order to estimate the term A4, remark that the differentiation (with respect to U pq) of the identity 
V,'(U)Vi(U) = lay yields 


ov, 














Simple computations show that 


ƏV; 


nfa 1 ð 
ue =) (2) (2) T j(U) 








Hence, for any a1,a2 € R? !, 


SR OO wom sserth 


Zij 


This relation, combined with the estimate |Z; il? < 2 for all i, j such that w;; # 0, implies the norm 


estimate 
irin 
ef =.. 
| "o (z,) (2) AREA 


< 6ļaı| |a| 3 |v W|I bY) 
= 


da] V7! (U)az 
dU 














< 6C„CẸ lai] |a2|N;(U)~! 
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This yields Ay < 216C2.C}CZW /(nh,)? and the assertion of the lemma follows. a 


Note that under the assumptions of Lemma 23, for some U satisfying ||U —1||2 < ||U —1|l2, it 
holds 


sup |e! (cje(U) —eje(0)| 


e€Sq-1 


lc e(U) —cje(1)| 





decje, -qT 
sup |vee | O)| vec(U —1)| 








e€Sq_-1 dU 
de'cjy x 
< sup |0] Ww -1k 
e€Su-1 dU 2 
V216Y 
< (Che + CHC)? -Iz (12) 


where vec(-) is a matrix operator that stacks the matrix’s columns one by one. In other terms, for 
every d x d matrix M, vee(M) = (ml 4,...,m] ,)' where me ; stands for the j' column of M. 


Lemma 24 There exists an integer no > 0 such that, for every n > no and for all k € {2,...,k(n)}, 
we have 8-1 < Pk, Qk < 4 and Cy < 1/2. 


1/(4V4) = phy and Pyinyhe(n) > Con” /?, the sequence 


40(co,/log(Ln) + citn) 


V2P x(n) Me(n) 


Proof In view of the relations Con 





Sy =4y/ CyCghy + 


tends to zero as n — %. 
We do now induction on k. Since sn — 0 as n — œ and Yı < sn, the inequality 6; = 2y./u* < 
1/2 = p1/V2 is true for sufficiently large values of n. Let us prove the implication 


Ck < 1/2, 


Òr < P 2 
k1 S pk-1/ V2 => a 


Since 1/v2 < e~'/©, the inequality 5,1 < p;/V2 entails that 5,1 < px and therefore Qg < 4. By 
our choice of a, and dp, we have pi hy > pxhy > Px(n)Ak(n)- Therefore, 


Vk z 40(co./log(Ln) + Citn) 
— < 4y CvCePrhr 4 
Px Vnpxhy 
40(co1/log(L t 
TTT E aA OE) ety) be 


VP x(n) k(n) 


Thus, for n large enough, Gg < 1/2 and Y < px/4. This implies that 8; = 2y(1 — Gk)! < px/ V2. 
By induction we infer that 6,-1 < Pr-1/V2 < px and C, < 1/2 for any k = 2,...,k(n)— 1. This 
completes the proof of the lemma. a 
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Lemma 25 [fk > 2 and 6-1 < 1 then Qg—ı C {tr(I —Ay_1)IT* < 8%_ |}. 
Proof Let us denote by By the vector IT* Bets which clearly belongs to S. It holds 


IBex—1 —Bel <Ye-1, 


IBe — Bel < V2Ye-1/Pe-1- 


Set B= yr uiBiB, and B = p ubb, where ĵ; = Lecebe if Bi = Lv cebe, see assumption (A2). 
Since Y|cc| < 1, we have |B;| < max, |By| < IIV fll and |B; — B:| < max |Be — Bel. Therefore 


Pi) (Bex1—BoO|<%1 => 


|B —Bl| < YanllB:BY — BB! || < u“ max 8:8) — BB} | 
i=l 
< u“ max (1B: — Bi? + 216:1- IB: — Bil) 
< p” (2g Pe +2V2%-1P;-ı max |Br|) = Gk- 


and hence, for every unit vector v € S, v! Bv > (v' By — |v" By — v! Šv|) > v' Bv- ||B — B|| > 
1 —Cy_1. This inequality implies that II* < (1 — C,_,)~!B. Thus the vectors By satisfy assumption 
(A2) with u* replaced by u*/(1 — &k-1). Applying Proposition 17 to these vectors we obtain the 
assertion of the lemma. a 
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